92 research outputs found
Depth CNNs for RGB-D scene recognition: learning from scratch better than transferring from RGB-CNNs
Scene recognition with RGB images has been extensively studied and has
reached very remarkable recognition levels, thanks to convolutional neural
networks (CNN) and large scene datasets. In contrast, current RGB-D scene data
is much more limited, so often leverages RGB large datasets, by transferring
pretrained RGB CNN models and fine-tuning with the target RGB-D dataset.
However, we show that this approach has the limitation of hardly reaching
bottom layers, which is key to learn modality-specific features. In contrast,
we focus on the bottom layers, and propose an alternative strategy to learn
depth features combining local weakly supervised training from patches followed
by global fine tuning with images. This strategy is capable of learning very
discriminative depth-specific features with limited depth images, without
resorting to Places-CNN. In addition we propose a modified CNN architecture to
further match the complexity of the model and the amount of data available. For
RGB-D scene recognition, depth and RGB features are combined by projecting them
in a common space and further leaning a multilayer classifier, which is jointly
optimized in an end-to-end network. Our framework achieves state-of-the-art
accuracy on NYU2 and SUN RGB-D in both depth only and combined RGB-D data.Comment: AAAI Conference on Artificial Intelligence 201
Multifaceted Analysis of Fine-Tuning in Deep Model for Visual Recognition
In recent years, convolutional neural networks (CNNs) have achieved
impressive performance for various visual recognition scenarios. CNNs trained
on large labeled datasets can not only obtain significant performance on most
challenging benchmarks but also provide powerful representations, which can be
used to a wide range of other tasks. However, the requirement of massive
amounts of data to train deep neural networks is a major drawback of these
models, as the data available is usually limited or imbalanced. Fine-tuning
(FT) is an effective way to transfer knowledge learned in a source dataset to a
target task. In this paper, we introduce and systematically investigate several
factors that influence the performance of fine-tuning for visual recognition.
These factors include parameters for the retraining procedure (e.g., the
initial learning rate of fine-tuning), the distribution of the source and
target data (e.g., the number of categories in the source dataset, the distance
between the source and target datasets) and so on. We quantitatively and
qualitatively analyze these factors, evaluate their influence, and present many
empirical observations. The results reveal insights into what fine-tuning
changes CNN parameters and provide useful and evidence-backed intuitions about
how to implement fine-tuning for computer vision tasks.Comment: Accepted by ACM Transactions on Data Scienc
KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
Vision-and-language navigation (VLN) is the task to enable an embodied agent
to navigate to a remote location following the natural language instruction in
real scenes. Most of the previous approaches utilize the entire features or
object-centric features to represent navigable candidates. However, these
representations are not efficient enough for an agent to perform actions to
arrive the target location. As knowledge provides crucial information which is
complementary to visible content, in this paper, we propose a Knowledge
Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent
navigation ability. Specifically, we first retrieve facts (i.e., knowledge
described by language descriptions) for the navigation views based on local
regions from the constructed knowledge base. The retrieved facts range from
properties of a single object (e.g., color, shape) to relationships between
objects (e.g., action, spatial position), providing crucial information for
VLN. We further present the KERM which contains the purification, fact-aware
interaction, and instruction-guided aggregation modules to integrate visual,
history, instruction, and fact features. The proposed KERM can automatically
select and gather crucial and relevant cues, obtaining more accurate action
prediction. Experimental results on the REVERIE, R2R, and SOON datasets
demonstrate the effectiveness of the proposed method.Comment: Accepted by CVPR 2023. The code is available at
https://github.com/XiangyangLi20/KER
GridMM: Grid Memory Map for Vision-and-Language Navigation
Vision-and-language navigation (VLN) enables the agent to navigate to a
remote location following the natural language instruction in 3D environments.
To represent the previously visited environment, most approaches for VLN
implement memory using recurrent states, topological maps, or top-down semantic
maps. In contrast to these approaches, we build the top-down egocentric and
dynamically growing Grid Memory Map (i.e., GridMM) to structure the visited
environment. From a global perspective, historical observations are projected
into a unified grid map in a top-down view, which can better represent the
spatial relations of the environment. From a local perspective, we further
propose an instruction relevance aggregation method to capture fine-grained
visual clues in each grid region. Extensive experiments are conducted on both
the REVERIE, R2R, SOON datasets in the discrete environments, and the R2R-CE
dataset in the continuous environments, showing the superiority of our proposed
method
Deep Learning for Logo Detection: A Survey
When logos are increasingly created, logo detection has gradually become a
research hotspot across many domains and tasks. Recent advances in this area
are dominated by deep learning-based solutions, where many datasets, learning
strategies, network architectures, etc. have been employed. This paper reviews
the advance in applying deep learning techniques to logo detection. Firstly, we
discuss a comprehensive account of public datasets designed to facilitate
performance evaluation of logo detection algorithms, which tend to be more
diverse, more challenging, and more reflective of real life. Next, we perform
an in-depth analysis of the existing logo detection strategies and the
strengths and weaknesses of each learning strategy. Subsequently, we summarize
the applications of logo detection in various fields, from intelligent
transportation and brand monitoring to copyright and trademark compliance.
Finally, we analyze the potential challenges and present the future directions
for the development of logo detection to complete this survey
- …